1 Description

1.1 Clinical Context

Computed Tomography scanner (CT scan) is a widely spread and popular exam in oncology: it reflects the density of the tissues of the human body. It is, then, adapted to the study of lung cancer because lungs are mostly filled with air (low density) while tumors are made of dense tissues.

1.2 Clinical context

Small Cell Lung Cancer can itself be split into four major subtypes based on histology observations: squamous cell carcinoma, large cell carcinoma, adenocarcinoma and a mixture of all

1.3 Goal

Predict the survival time of a patient (remaining days to live) from one three-dimensional CT scan (grayscale image) and a set of pre-extracted quantitative imaging features, as well as clinical data.

1.4 dataset

To each patient corresponds one CT scan, and one binary segmentation mask. The segmentation mask is a binary volume of the same size as the CT scan, except that it is composed of zeroes everywhere there is no tumour, and 1 otherwise. The CT scans and the associated segmentation masks are subsets of two public datasets:

  • NSCLC Radiomics (subset of 285 patients)
  • NSCLC RadioGenomics(subset of 141 patients)

Both training and validation contain for each patient, the time to event (days), as well as the censorship. Censorship indicates whether the event (death) was observed or whether the patient escaped the study: this can happen when the patient’s track was lost, or if the patient died of causes not related to the disease.

2 Setting python version and anaconda environment for R :-)

## python:         /Users/Mezhoud/anaconda3/bin/python3
## libpython:      /Users/Mezhoud/anaconda3/lib/libpython3.7m.dylib
## pythonhome:     /Users/Mezhoud/anaconda3:/Users/Mezhoud/anaconda3
## version:        3.7.5 (default, Oct 25 2019, 10:52:18)  [Clang 4.0.1 (tags/RELEASE_401/final)]
## numpy:          /Users/Mezhoud/anaconda3/lib/python3.7/site-packages/numpy
## numpy_version:  1.17.3
## 
## NOTE: Python version was forced by use_python function

3 Explore scans and masks of Tumor lung cancer

## the dimension of scan array is:  (92, 92, 92)
## the dimension of mask array is:  (92, 92, 92)
## plot some images from patient 002:

3.1 Function to plot multiple image from array

The plot shows colored images scan of 6 slides. At this step it is not easy to distinguish the tumor.

The dataset has aslo the masks for each scan slide which locate the position of the tumor in the scan.

  • The first 3 slides do not have tumor streak, however the next 3 ones indicate the position of the tumor in red color.
  • If we plot more slides, we can observe the increase of the size of the tumor during plotting slides.
  • At the end the size the Tumor is decreasing.

  • We can note that the crop is adjusted to the size of the tumor

If we compare with the scan slides, we obtain:

  • It is always not easy to delimit the tumor in scan images

  • Comparing to masks, we can note that, between scan 34 and scan 65, the slides have more yellow stain or less blue color.

4 Explore images from test dataset

Plot mask slides from test dataset

5 Import image from python environment to R

The goal of this step is to convert image matrices as vector. So, each image can be ranged in one row. Finally, we can obtain one dataframe with 92 rows (images) ofr each sample (patient).

5.1 Import useful R packages

5.3 Understanding the structure of the array of images

## [1] "One image is a: matrix"
## [1] "Two images are an: array"
## [1] "Print the first 10 pixels of Scan N°1: "
##       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
##  [1,] -777 -759 -707 -697 -749 -796 -826 -837 -858  -860
##  [2,] -783 -791 -774 -768 -787 -808 -827 -826 -829  -829
##  [3,] -804 -841 -827 -812 -820 -840 -831 -801 -792  -794
##  [4,] -830 -857 -839 -816 -805 -818 -801 -764 -734  -722
##  [5,] -844 -854 -843 -823 -799 -787 -771 -743 -704  -670
##  [6,] -844 -849 -844 -831 -821 -810 -809 -791 -760  -719
##  [7,] -848 -847 -848 -844 -847 -841 -849 -841 -829  -806
##  [8,] -847 -856 -854 -846 -840 -813 -826 -849 -848  -836
##  [9,] -840 -851 -836 -823 -835 -796 -803 -853 -857  -833
## [10,] -847 -841 -829 -817 -860 -832 -799 -842 -865  -838
## [1] "Print the first 10 pixels of Mask N°1: "
##        [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]  [,8]  [,9] [,10]
##  [1,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [2,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [3,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [4,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [5,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [6,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [7,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [8,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [9,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [10,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

5.5 Convert image matrix to vector

## [1] "The output is a: list"
## [1] "With length of: 2"
## [1] "The names of two elements are: "
## [1] "patient_002_scan" "patient_002_mask"
## [1] "which are: data.frame"
## [1] "The dimension of each dataframe is: "
## [1]   92 8464
  • At this step we stop exploring scan and mask.

  • We think to use only masks for modeling

  • Potential method: keras, mxnet

6 Exploratory Data Analysis of radiomics and clinical data

##           V1                          V2                          V3
## 1:                                 shape                       shape
## 2:           original_shape_Compactness1 original_shape_Compactness2
## 3: PatientID                                                        
## 4:       202                 0.027815034                 0.274891585
## 5:       371                  0.02301549                 0.188210005
## 6:       246                 0.027348106                 0.265739895
##                                  V4                                    V5
## 1:                            shape                                 shape
## 2: original_shape_Maximum3DDiameter original_shape_SphericalDisproportion
## 3:                                                                       
## 4:                      48.55924217                           1.537964054
## 5:                      75.70336849                           1.744961158
## 6:                      70.43436661                           1.555420243
##                           V6                         V7
## 1:                     shape                      shape
## 2: original_shape_Sphericity original_shape_SurfaceArea
## 3:                                                     
## 4:               0.650210255                 5431.33321
## 5:               0.573078659                10369.56873
## 6:               0.642913068                10558.81869
##                                   V8
## 1:                             shape
## 2: original_shape_SurfaceVolumeRatio
## 3:                                  
## 4:                       0.275227763
## 5:                       0.240726824
## 6:                       0.200765988
##    PatientID               Histology Mstage Nstage SourceDataset Tstage     age
## 1:       202          Adenocarcinoma      0      0            l2      2 66.0000
## 2:       371              large cell      0      2            l1      4 64.5722
## 3:       246 squamous cell carcinoma      0      3            l1      2 66.0452
## 4:       240                     nos      0      2            l1      3 59.3566
## 5:       284 squamous cell carcinoma      0      3            l1      4 71.0554
## 6:       348 squamous cell carcinoma      0      2            l1      2 65.0212

The radiomics features can be divided into 4 groups as follows (shown in row 1): - Group 1. First order statistics - Group 2. Shape and size based features - Group 3. Textural features - Group 4. Wavelet features

Each group can be subset into several sub-groups shown in row 2 of the radiomics dataset. To make the radiomics features numeric dataset we need to remove the two first rows and convert them to colnames.

##    Groups                              Features
## V2  shape           original_shape_Compactness1
## V3  shape           original_shape_Compactness2
## V4  shape      original_shape_Maximum3DDiameter
## V5  shape original_shape_SphericalDisproportion
## V6  shape             original_shape_Sphericity
## V7  shape            original_shape_SurfaceArea

To improve the esthetic of the dataframe, we note:

6.2 New Colnames of radiomics

## [1] "shape_Compactness1"           "shape_Compactness2"          
## [3] "shape_Maximum3DDiameter"      "shape_SphericalDisproportion"
## [5] "shape_Sphericity"             "shape_SurfaceArea"

6.3 Get new radiomics style

##   PatientID shape_Compactness1 shape_Compactness2 shape_Maximum3DDiameter
## 1       202         0.02781503          0.2748916                48.55924
## 2       371         0.02301549          0.1882100                75.70337
## 3       246         0.02734811          0.2657399                70.43437
## 4       240         0.02681111          0.2554064                46.81880
## 5       284         0.02369124          0.1994242                53.79591
## 6       348         0.03098136          0.3410383                63.74951
##   shape_SphericalDisproportion shape_Sphericity shape_SurfaceArea
## 1                     1.537964        0.6502103          5431.333
## 2                     1.744961        0.5730787         10369.569
## 3                     1.555420        0.6429131         10558.819
## 4                     1.576120        0.6344693          4221.412
## 5                     1.711620        0.5842418          5295.900
## 6                     1.431305        0.6986630          8493.134
##   shape_SurfaceVolumeRatio
## 1                0.2752278
## 2                0.2407268
## 3                0.2007660
## 4                0.3238780
## 5                0.3272407
## 6                0.1976017

6.5 The first principal component order of the features

6.6 The hierarchical clustering order of the features

  • We can return to these heatmap when we predict the most importante features using modeling.

6.7 Explore Clinical data

  • The most frequente cases is Adenocarcinoma followed by Aquamous Cell Carcinoma.

  • NOS: not otherwise specified

  • It seems NOS and Nsclc Nos corespond to the same category

  • Nan is not available ?

  • The density plot shows that the must frequent cases are 65, 70, 72 years old.

_ The most frequent Nstage class is also 0, followed by 2, 3, and 1.

  • The third plots shows that the most cases are in Mstage == 0. We can focus only in this class.

  • There are two sources of dataset.

6.8 Explore output_train and output_test

##    PatientID SurvivalTime Event
## 1:       202         1378     0
## 2:       371          379     1
## 3:       246          573     1
## 4:       240          959     0
## 5:       284         2119     0
## 6:       348          706     1
##    PatientID SurvivalTime Event
## 1:        13     788.4177   NaN
## 2:       155     427.6501   NaN
## 3:       404     173.5872   NaN
## 4:       407     389.8780   NaN
## 5:         9    1580.7672   NaN
## 6:        49     472.5234   NaN
  • The goal is the fill Event variable in output_test by 0 or 1.

7 Train data preparation: Merge clinical, radimoics, and output_train dataset

##    PatientID Event Histology Mstage Nstage SourceDataset Tstage     age
## 1:       202     0         1      0      0             2      2 66.0000
## 2:       371     1         2      0      2             1      4 64.5722
## 3:       246     1         6      0      3             1      2 66.0452
## 4:       240     0         4      0      2             1      3 59.3566
## 5:       284     0         6      0      3             1      4 71.0554
## 6:       348     1         6      0      2             1      2 65.0212
##    SurvivalTime shape_Compactness1
## 1:         1378         0.02781503
## 2:          379         0.02301549
## 3:          573         0.02734811
## 4:          959         0.02681111
## 5:         2119         0.02369124
## 6:          706         0.03098136

7.1 Explore missing value in train

  • There are 18 missing age from 300.

8 Test data preparation: Merge clinical, radimoics, and output_test dataset

8.1 New_radiomics_test

##   PatientID shape_Compactness1 shape_Compactness2 shape_Maximum3DDiameter
## 1        13         0.02888522         0.29645143               106.90182
## 2       155         0.03194837         0.36266005                18.81489
## 3       404         0.01599883         0.09094503               105.08092
## 4       407         0.03135766         0.34937318                46.96807
## 5         9         0.01781454         0.11275905                56.54202
## 6        49         0.03816202         0.51744596                20.12461
##   shape_SphericalDisproportion shape_Sphericity shape_SurfaceArea
## 1                     1.499738        0.6667830        29085.5414
## 2                     1.402276        0.7131265          629.4436
## 3                     2.223687        0.4497036        12509.2654
## 4                     1.419832        0.7043089         4067.6574
## 5                     2.069901        0.4831149         7093.3657
## 6                     1.245599        0.8028264          844.2344
##   shape_SurfaceVolumeRatio
## 1                0.1145278
## 2                0.7038788
## 3                0.3152977
## 4                0.2821040
## 5                0.3760316
## 6                0.5088176

8.3 Merge clinical, radimoics, and output_test dataset

##    PatientID Event Histology Mstage Nstage SourceDataset Tstage     age
## 1:        13   NaN         4      0      0             1      4 44.3970
## 2:       155   NaN         1      0      3             1      1 63.3183
## 3:       404   NaN         2      0      2             1      2 64.7255
## 4:       407   NaN         4      0      0             1      2 65.3635
## 5:         9   NaN         1      0      0             2      2 50.0000
## 6:        49   NaN         6      0      0             1      2 86.1410
##    SurvivalTime shape_Compactness1
## 1:     788.4177         0.02888522
## 2:     427.6501         0.03194837
## 3:     173.5872         0.01599883
## 4:     389.8780         0.03135766
## 5:    1580.7672         0.01781454
## 6:     472.5234         0.03816202

8.4 Explore missing value in test

  • There are 4 missing age from 125.

9 Scaling Train and Test dataset

##           used  (Mb) gc trigger  (Mb) limit (Mb) max used  (Mb)
## Ncells 2315329 123.7    3851154 205.7         NA  3851154 205.7
## Vcells 6522974  49.8   12255594  93.6     102400 10146320  77.5

10 Preprocessing for modeling

10.1 Split Train dataset into Train & Valid sets

## <241/59/300>
  • We can retrieve our training and testing sets using training() and testing() functions.
##     Event  Histology     Mstage     Nstage SourceDataset     Tstage        age
##  1:     0 -1.0235305 -0.1207812 -0.8392371     1.4175490 -0.1386218 -0.2504355
##  2:     1 -0.5217999 -0.1207812  0.8392371    -0.7037831  1.7024495 -0.3972837
##  3:     1  1.4851227 -0.1207812  1.6784742    -0.7037831 -0.1386218 -0.2457867
##  4:     0  1.4851227 -0.1207812  1.6784742    -0.7037831  1.7024495  0.2695086
##  5:     1  1.4851227 -0.1207812  0.8392371    -0.7037831 -0.1386218 -0.3511044
##  6:     1 -1.0235305 -0.1207812 -0.8392371     1.4175490 -1.0591575  0.1609615
##  7:     1 -1.0235305 -0.1207812 -0.8392371    -0.7037831  1.7024495  0.6186818
##  8:     1 -1.0235305 -0.1207812  0.8392371    -0.7037831  1.7024495 -1.5788159
##  9:     0 -0.5217999 -0.1207812  0.8392371    -0.7037831 -0.1386218 -2.0372357
## 10:     1  1.4851227 -0.1207812  0.8392371     1.4175490 -0.1386218  1.1894541
##     SurvivalTime shape_Compactness1 shape_Compactness2
##  1:    0.7487397          0.3273910          0.2206572
##  2:   -0.7068626         -0.4508739         -0.5451008
##  3:   -0.4241931          0.2516768          0.1398098
##  4:    1.8284207         -0.3412982         -0.4460329
##  5:   -0.2304042          0.8408228          0.8050072
##  6:    0.7356262         -0.8618943         -0.8911634
##  7:   -0.9720474          0.3619137          0.2579747
##  8:   -0.9735045         -1.1924952         -1.1402475
##  9:    0.3815607          0.9177119          0.8979347
## 10:   -0.7345467         -0.0934834         -0.2114097

10.2 Format train and test to DMatrix

## 
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
## 
## Attaching package: 'xgboost'
## The following object is masked from 'package:dplyr':
## 
##     slice

10.3 Optimize features with Cross validation

Here, we can see after how many rounds, we achieved the smallest test error.

## [1]  train-auc:0.725368+0.064529 test-auc:0.673917+0.082949 
## Multiple eval metrics are present. Will use test_auc for early stopping.
## Will train until test_auc hasn't improved in 1000 rounds.
## 
## [1001]   train-auc:0.857767+0.010216 test-auc:0.717012+0.070853 
## Stopping. Best iteration:
## [18] train-auc:0.813827+0.023060 test-auc:0.718815+0.079673
## Time difference of 3.316236 secs

11 Train the model

## [1]  train-auc:0.685120  eval-auc:0.663793 
## [18] train-auc:0.796706  eval-auc:0.756322
## Time difference of 0.03550601 secs

11.1 Predict valid_2 dataset

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4970  0.4980  0.5001  0.5003  0.5024  0.5041
  • We suppose that if Prob > 0.5 (Median), the Event is 1, else 0

11.2 Transform propability to binary classification

## pred_bin
##  0  1 
## 27 32

11.3 Confusion matrix for Tree model

## # A tibble: 4 x 3
##   prediction label     n
##        <dbl> <dbl> <int>
## 1          0     0    18
## 2          0     1     9
## 3          1     0    11
## 4          1     1    21

12 Extract the most important features from tree xgboost model

12.1 List the most important features

##                                 Feature        Gain       Cover  Frequency
##  1:             glcm_MaximumProbability 0.112158754 0.124971897 0.11320755
##  2:                       glcm_Contrast 0.104582668 0.082635647 0.09433962
##  3:                        SurvivalTime 0.094105432 0.053882488 0.05660377
##  4:               firstorder_Uniformity 0.061805311 0.026488328 0.01886792
##  5:                   glcm_JointEntropy 0.052352252 0.058184483 0.05660377
##  6:                       SourceDataset 0.050837990 0.053656341 0.03773585
##  7:                   firstorder_Energy 0.047711754 0.058636688 0.05660377
##  8:                            glcm_Idn 0.045711015 0.068372428 0.07547170
##  9:                 glrlm_RunPercentage 0.044887643 0.038260811 0.03773585
## 10:              glcm_DifferenceEntropy 0.043768600 0.030563903 0.01886792
## 11:              glrlm_ShortRunEmphasis 0.040857945 0.026714878 0.01886792
## 12:               glrlm_LongRunEmphasis 0.037738614 0.033054223 0.03773585
## 13:                   shape_VoxelVolume 0.025881372 0.018111877 0.01886792
## 14:                  firstorder_Maximum 0.025577376 0.037129306 0.03773585
## 15:                  firstorder_Minimum 0.025106352 0.017432659 0.03773585
## 16:                              Nstage 0.024500351 0.030337327 0.01886792
## 17: glrlm_ShortRunHighGrayLevelEmphasis 0.019625924 0.028979068 0.01886792
## 18:        firstorder_StandardDeviation 0.018326938 0.019923085 0.01886792
## 19:                   shape_SurfaceArea 0.017815092 0.021960318 0.01886792
## 20:                   glcm_ClusterShade 0.015217592 0.011999033 0.01886792
## 21:                            glcm_Idm 0.015101068 0.019696527 0.01886792
## 22:                glcm_InverseVariance 0.014044755 0.009508703 0.01886792
## 23:  glrlm_ShortRunLowGrayLevelEmphasis 0.012071897 0.011999098 0.01886792
## 24:                glcm_Autocorrelation 0.010722093 0.023771927 0.01886792
## 25:                              Tstage 0.010224976 0.019923104 0.01886792
## 26:                                 age 0.008165652 0.019696701 0.01886792
## 27:                 firstorder_Kurtosis 0.007529256 0.021507924 0.01886792
## 28:                  shape_Compactness2 0.005142338 0.018338204 0.01886792
## 29:                             glcm_Id 0.004520085 0.008150340 0.01886792
## 30:              glcm_DifferenceAverage 0.003908905 0.006112684 0.01886792
##                                 Feature        Gain       Cover  Frequency
  • Survival Time is the most important feature, followed by age, and 8 texture description.

13 Prediction

13.2 Prediction with Tree xgboost model

## [1] 0.5031433 0.4991213 0.5023177 0.5022252 0.4970879 0.4981019

13.3 summarize probabilities of Events

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4969  0.4991  0.5008  0.5006  0.5024  0.5040